Controlling the False Discovery Rate for Feature Selection in High-resolution NMR Spectra
نویسندگان
چکیده
Successful implementation of feature selection in nuclear magnetic resonance (NMR) spectra not only improves classification ability, but also simplifies the entire modeling process and, thus, reduces computational and analytical efforts. Principal component analysis (PCA) and partial least squares (PLS) have been widely used for feature selection in NMR spectra. However, extracting meaningful metabolite features from the reduced dimensions obtained through PCA or PLS is complicated because these reduced dimensions are linear combinations of a large number of the original features. In this paper, we propose a multiple testing procedure controlling false discovery rate (FDR) as an efficient method for feature selection in NMR spectra. The procedure clearly compensates for the limitation of PCA and PLS and identifies individual metabolite features necessary for classification. In addition, we present orthogonal signal correction to improve classification and visualization by removing unnecessary variations in NMR spectra. Our experimental results with real NMR spectra showed that classification models constructed with the features selected by our proposed procedure yielded smaller misclassification rates than those with all features.
منابع مشابه
Classification of High-Resolution NMR Spectra Based on Complex Wavelet Domain Feature Selection and Kernel-Induced Random Forest
High-resolution nuclear magnetic resonance (NMR) spectra contain important biomarkers that have potentials for early diagnosis of disease and subsequent monitoring of its progression. Traditional features extraction and analysis methods have been carried out in the original frequency spectrum domain. In this study, we conduct feature selection based on a complex wavelet transform by making use ...
متن کاملThe False Discovery Rate in Simultaneous Fisher and Adjusted Permutation Hypothesis Testing on Microarray Data
Background and Objectives: In recent years, new technologies have led to produce a large amount of data and in the field of biology, microarray technology has also dramatically developed. Meanwhile, the Fisher test is used to compare the control group with two or more experimental groups and also to detect the differentially expressed genes. In this study, the false discovery rate was investiga...
متن کاملBounding the False Discovery Rate in Local Bayesian Network Learning
Modern Bayesian Network learning algorithms are timeefficient, scalable and produce high-quality models; these algorithms feature prominently in decision support model development, variable selection, and causal discovery. The quality of the models, however, has often only been empirically evaluated; the available theoretical results typically guarantee asymptotic correctness (consistency) of t...
متن کاملAn effective method for controlling false discovery and false nondiscovery rates in genome-scale RNAi screens.
In most genome-scale RNA interference (RNAi) screens, the ultimate goal is to select siRNAs with a large inhibition or activation effect. The selection of hits typically requires statistical control of 2 errors: false positives and false negatives. Traditional methods of controlling false positives and false negatives do not take into account the important feature in RNAi screens: many small-in...
متن کاملA Novel Architecture for Detecting Phishing Webpages using Cost-based Feature Selection
Phishing is one of the luring techniques used to exploit personal information. A phishing webpage detection system (PWDS) extracts features to determine whether it is a phishing webpage or not. Selecting appropriate features improves the performance of PWDS. Performance criteria are detection accuracy and system response time. The major time consumed by PWDS arises from feature extraction that ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Statistical analysis and data mining
دوره 1 2 شماره
صفحات -
تاریخ انتشار 2008